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Factoring Symmetric Indefinite Matrices 
on High-Performance Architectures 

Mark T. Jones*and Merrell L. Patrick** 


Abstract 

The Bunch-Kaufman algorithm is the method of choice for factor- 
ing symmetric indefinite matrices in many applications. However, the 
Bunch-Kaufman algorithm does not take advantage of high-performance 
architectures such as the Cray Y-MP. Three new algorithms, based 
on Bunch-Kaufman factorization, that take advantage of such archi- 
tectures are described. Results from an implementation of the third 
algorithm are presented. 
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1 Introduction 


The Bunch-Kaufman algorithm is considered one of the best methods for 
factoring full, symmetric, indefinite matrices [1], [2], A modified version has 
been successfully used to factor sparse, indefinite matrices [3]. Recently, 
Bunch-Kaufman factorization has been shown to be the method of choice for 
a subset of banded, symmetric indefinite matrices [4]. 

The Bunch-Kaufman algorithm maintains the symmetry of the matrix 
during factorization and yields the inertia of the matrix essentially for free, 
an important consideration for eigenvalue calculations [1]. A drawback to the 
Bunch-Kaufman algorithm is its unsuitability for high-performance architec- 
tures. Herein, three new algorithms, based on Bunch-Kaufman factorization, 
are given for architectures such as the Cray Y-MP. 

The technique of loop unrolling for vector architectures is discussed in 
section 2. In section 3, one of several variations of the Bunch-Kaufman al- 
gorithm is reviewed and the reason for its unsuitability for high-performance 
architectures is given. Three new algorithms for high-performance architec- 
tures are developed in section 4. Results showing the benefits of the third 
algorithm are given in section 5. Finally, a summary and description of future 
work is given in section 6. 


2 Loop-Unrolling 

Loop unrolling is a well known technique for improving performance on vector 
architectures. A loop is unrolled by restructuring it to allow more compu- 
tation to take place at each step. A simple example of loop unrolling from 
[5] is given in Figure 1. The outer DO-loop has been unrolled to a depth of 
four. In the original program segment, three vector memory references were 
required for every two vector floating point operations. The ratio for the un- 
rolled program segment is six vector memory references for every eight vector 
floating point operations. A significant decrease in the number of memory 
references has been achieved. 

The reduction in the number of vector memory operations reduces the 
probability of delays due to memory latency times as well as the possibility 
of memory contention in a parallel computer [6]. 

Three other benefits of loop unrolling are described in [7]. The first is a 
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C Original program segment 
DO 20 J = 1, N2 
DO 10 I = 1, N1 
Y(I) = Y(I) + X(J) * M(I,J) 

10 CONTINUE 

20 CONTINUE 

C In this example, the end condition if N2 isn’t a multiple of four is ignored 
DO 20 J = 4, N2, 4 
DO 10 I = 1, N1 

Y(I) = Y(I) + X(J-3) * M(I,J-3) + X(J-2) * M(I,J-2) + 
c X(J-l) * M(I,J-1) + X(J) * M(I,J) 

10 CONTINUE 

20 CONTINUE 

Figure 1: Simple loop unrolling example 

reduction in loop overhead because fewer incrementing and testing operations 
are required. This benefit can be reaped by any computer architecture. 

For computers with segmented functional units, such as the CDC 7600, 
the higher ratio of floating point operations to overhead operations will allow 
a higher level of concurrency within a functional unit. 

Computers with independent functional units, such as the Cray-1, benefit 
from greater concurrency between the functional units. 

The optimal depth of loop unrolling is largely dependent on the target 
architecture. For example, if the independent functional units of a computer 
are kept busy with loop unrolling of depth p, then increasing the depth to 
p -(- 1 will not result in increased concurrency among functional units. 

In the simple example in Figure 1, the results of iteration j of the outer 
loop did not depend on results of previous iterations. Therefore, the outer 
loop was easily unrolled. If LDL T decomposition is considered, however, each 
iteration of the outer loop depends on the previous iterations (see Figure 2). 
Unrolling the outer loop causes several pivot columns to be used simultane- 
ously to update the remaining non-zeroes. For the algorithm to be correct, 
the first pivot column must be used to update the other pivot columns, then 
the second pivot column used to update the remaining pivot columns, and so 
forth. After all the pivot columns are updated, they are used to update the 
remaining non-zeroes. Conceptually, loop unrolling in LDL ^ allows the use 
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1) DO 10 I = 1, N 

C Compute the pivot column 

2) DO 20 J = 1+1, N 

3) V(J) = A( J ,1) 

4) A(J,I) = A(J,I)/A(I,I) 

5) 20 CONTINUE 

C Update the remaining non-zeroes 

6) DO 30 J = 1+1, N 

7) DO 40 K = J, N 

8) A(J,K) = A(J ,K) - V(K)*A(J,I) 

9) 40 CONTINUE 

10) 30 CONTINUE 

11) 10 CONTINUE 

Figure 2: The LDL T algorithm 

of matrix- matrix operations rather than matrix- vector operations. A version 
of LDL t unrolled to a depth of three is given in Figure 3. 

Because three pivot columns are used to update the remaining non-zeroes 
in step 20, each time an element, a jik , is fetched six floating point computa- 
tions are done, rather than just two as in step 8 of the original algorithm. 


3 The Bunch-Kaufman Algorithm 

The Bunch-Kaufman algorithm factors A, an n x n real symmetric indefinite 
matrix, into LDL T while doing symmetric permutations on A to maintain 
stability, resulting in the following equation: 

PAP t = LDL T . (1) 

Although several variations of the algorithm exist, the focus here is on algo- 
rithm D from [1] because it is the simplest to discuss. The methods described 
in section 4 are also applicable to Algorithm A described in [1], but not to 
Algorithm C (Algorithm B is mentioned in [1], but it is not described in 

detail). 

The Bunch-Kaufman algorithm maintains stability by using 2x2 pivots 
combined with symmetric permutations to A when a lxl pivot is not stable. 
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C Loop unrolled version 

C The end condition for handling N not divisible by 3 has been ignored 

1) DO 10 I = 1, N, 3 

C Compute the upper 3x3 triangle of the pivot columns 

2) V1(I+1) = A(I+1,I) 

3) A(I+1,I) = A(I+1,I)/A(I,I) 

4) Vl(I+2) = A(I+2,I) 

5) A(I+2,I) = A(I+2,I)/A(I,I) 

6) A(I+1,I+1) = A(I+1,I+1) - V1(I+1)*A(I+1,I) 

7) V2(I+2) = A(I+2,I+1) - V1(I+1)*A(I+2,I) 

8) A(I+2,I+1) = V2(I+2)/A(I+l,I+l) 

9) A(I+2,I+2) = A(I+2,I+2) - V1(I+2)*A(I+2,I) - V2(I+2)*A(I+2,I+1) 
C Update and compute all three pivot columns 

10) DO 20 J = 1+3, N 

11) V1(J) = A(J,I) 

12) A(J,I) = A(J,I)/A(I,I) 

13) V2(J) = A(J,I+1) - V1(I+1)*A(J,I) 

14) A(J,I+1) = V2(J)/A(I+1,I+1) 

15) V3(J) = A(J,I+2) - V1(I+2)*A(J,I) - V2(I+2)*A(J,I+1) 

16) A(J,I+2) = A(J,I+2)/A(I+2,I+2) 

17) 20 CONTINUE 

C Use the 3 pivots columns to update the remaining non-zeroes 

18) DO 30 J = 1+3, N 

9) DO 40 K = J, N 

20) A(J,K) = A(J,K) - V1(K)*A(J,I) - V2(K)*A(J,I+1) - 

C V3(K)*A(J,I+2) 

21) 40 CONTINUE 

22) 30 CONTINUE 

23) 10 CONTINUE 

Figure 3: LDL T unrolled to a depth of three 
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Because this paper will concentrate on lxl pivots, only stability for these 
pivots will be discussed in detail. A lxl pivot for element at step k takes 
the form 

j ^ . ( 2 ) 

Let (i k be the maximum of the absolute values of the uneliminated elements 
at step k. Step 2 of the algorithm (shown in Figure 4) finds the maximum 
element, A, in the pivot column. By substituting ji and A into equation 2, 
the bound on fik+x becomes 

Mfc +1 < Hk + A 2 / I a k>k |< n k (l + A/ I a kk |). (3) 

Step 4 ensures that a lxl pivot occurs if a <\ a^i | /A, where the parameter 
a has been chosen to be 0.525 to maximize stability for Algorithm D [1]. By 
substituting a into equation 3, 

Hk+i < M 1 + i/«)- ( 4 ) 

Therefore, the bound on the growth of an element due to a lxl pivot is 2.905. 

If the test in step 4 is failed, a subsequent row search and another stability 
test determines if a 2x2 pivot and a permutation are necessary. 

The stability checks and possible permutations at each step of the Bunch- 
Kaufman algorithm prevent the use of the same type of loop unrolling that 
is used for LDL T decomposition. Because the stability checks and permuta- 
tions must be completed before a pivot column is computed, pivot columns 
cannot be grouped as they were in Figure 2 without invalidating the bounds 
on element growth. 

4 New Algorithm 

This section will develop three ways in which the Bunch-Kaufman algorithm 
can be modified to allow pivot columns to be grouped together in one step. 
Because each 2x2 pivot involves a permutation of A , it is not possible to group 
2x2 pivots together. However, lxl pivots can be grouped with a 2x2 pivot if 
they follow the 2x2 pivot, allowing the permutation to precede the updating 
and computing of pivot columns. Each 2x2 pivot can be implemented using 
loop unrolling of depth 2. The general strategy in this section will be to 
try to group several lxl pivots into a single step in a stable fashion. The 
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1) for i = 1, to 

begin 

2) A = maxj =i+ i |fl j a jti | 

3) set r to the row number of A 

4) if A a < | aij | then 
begin 

5) perform a lxl pivot 
end 

else 

begin 

6) a = max J= j +1 , n | a r<j \ 

7) if a A 2 < a \ a*,* | then 
begin 

8) perform a lxl pivot 

end 

else 

begin 

9) exchange rows and columns r and i + 1 

10) perform a 2x2 pivot 

11) i = i + 1 
end 

end 

12) end 

13) if inertia is desired, then scan the D matrix 

Figure 4: The Bunch-Kaufman Factorization Algorithm 
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strategies described in this section are applicable to Algorithms A and D> 
but not C from [1]. Algorithm C cannot be unrolled using these strategies 
because a permutation occurs at every step. 

The first approach uses pxp pivots in much the same way as the 2x2 pivots 
in the Bunch-Kaufman algorithm. The update of a single element of A using 
a 2x2 pivot at step k is 


( a i t k a k+l t k+l ~ &i t k+l a k+l t k) a j,k + {ai t k+l a k t k a i t k a k+l t k) a j,k+l 


CLi A — fit 7 

To obtain a bound for element growth, first define at step k 


(5) 


A x = maxi = k+i,n | o.i, k | 


( 6 ) 


and, 


A 2 — maxi = k+ 2 ,n | Ql'k+l | • 


If /xa, Ai, and A 2 are substituted into equation 5, then 


Pk +2 < Pk{ 1 + 


A* + Aj + 2A1A2 v 
2 1 1 

0 *,*0i+l,i+l — a i+l,i I 


(7) 

( 8 ) 


The bound on growth of p k + 2 for a 2x2 pivot in Algorithm D is 8.526 [1]. 
Therefore, a 2x2 pivot can be performed if 

1 + , Ih+hZj < 8.526. (9) 

This derivation is similar to the 2x2 pivot stability analysis in [1]. This test 
differs from the Bunch-Kaufman 2x2 test because it groups two potential 
lxl pivots into one step. The Bunch-Kaufman 2x2 test is used when a lxl 
pivot is not stable. The new test precedes step 4 of the Bunch-Kaufman 
algorithm in Figure 4. This approach has two drawbacks: 1) the formulae 
for bounding the growth due to a pxp pivot become increasingly complicated 
as p increases, and 2) the inertia calculation is no longer just a search down 
the diagonal, it requires the solution of many pxp eigenproblems. 

The second approach updates the potential pivot columns one at a time 
and, after each update, applies the lxl pivot stability check to determine if 
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4a) if A a < | a t> ; | then 
begin 

4b) compute the zth pivot column and use it to update column i + 1 

4c) A 2 = TTldX 2,n | | 

4d) if A 2 ct >| a,i+i t i+i | then 
begin 

4e) compute the ( z + l)th pivot column and use it and column i to 
update column i + 2 

4f) -^3 = j_|_3 jn | &j t i+2 | 

4g) if A 3 a >| a ;+ 2 ,*+2 I then 
begin 

4h) use columns i, i + 1, and z + 2 to update the remaining 

non-zeroes 
end 
else 
begin 

4i) use columns z and i + 1 to update the remaining non-zeroes 

end 
end 
else 
begin 

4j) perform a lxl pivot 
end 
end 


Figure 5: Approach Two for Loop Unrolling 
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further unrolling is possible. This test replaces steps 4 and 5 of the Bunch- 
Kaufman algorithm. An example of this test for up to three columns is given 
in Figure 5. 

The third strategy uses an a priori approach to predict stability. The 
strategy predicts the stability of grouping p lxl pivots without updating 
each potential pivot column. Only the upper pxp triangle must be updated 
to bound element growth. From equation 4, for p successive lxl pivots to 
maintain stability, the maximum element growth must be bounded by 


(1 + 1 /a) p . 

(10) 

At step i fc, let 

A 2 = max J== fc + 2,n | a j t k + i | • 

(11) 

The bound on element growth for a lxl pivot is (1 + A/a fc|fc ), therefore the 
bound on A 2 after a lxl pivot is 

^2 < ^2(1 + 

(12) 

Because the upper pxp triangle has been updated, a bound 
second lxl pivot using column k + 1 is 

on fi k + 2 for a 

Mfc+2 < M*+i(l + ^/ajfc+i.fc+i)- 

(13) 

By substituting the bound for /z* +1 into equation 13 


MAH-2 < ^k{ 1 + Va*,A:)(l + 'Wa*:+l > A:+l)* 

(14) 

In general, the bound on pk+ P -i for p lxl pivots is 


^ /i 1 ftk+p— 2^fc-J-p— 1 \ 

MA:+p-l < Pk+p- 2(1 + ), 

a fc+p-l,A:+p-l 

(15) 

where 


^ k+p-l ~ rriax j=k+ Pi n | a j t k+p - 1 1 * 

(16) 


Given the bound for p — 1 lxl pivots, the only new information necessary 
is the updated value of ak+ p -i t k+ P -i and \k+ p -i . If the bound for fik+( P -i) 
is small enough then the p pivot columns are computed at the same time 
and all used at the same time to update the remaining non-zeroes. The a 
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priori strategy has two advantages over the second strategy. First, it allows 
all the pivot columns to be updated and computed in one loop. The second 
strategy requires the pivot columns to be updated and computed one after 
the other. Therefore, the a priori method reaps the benefits of loop unrolling 
in the pivot column calculation. Second, the a priori method can combine a 
pivot that fails the lxl pivot test with a lxl pivot that is very stable if the 
combination of the two meets the stability criterion. The second method is 
unable to combine the two pivots if one of the pivots fails the lxl stability 
test. Another benefit the a priori method reaps from this combination is 
avoiding the search for cj in step 6 of the Bunch-Kaufman algorithm. A 
disadvantage of the a priori method when compared with the second method, 
is the use of estimated bounds. In some cases these bounds are too pessimistic 
and thereby prevent the combining of lxl pivots when the combination is 
stable. Because the second method actually computes the pivot columns it 
does not have to use estimated bounds. 


5 Results 

A version of the a priori algorithm described in section 4 suitable for variable 
banded systems was implemented on a Cray Y-MP. The uniprocessor imple- 
mentation allows loop unrolling up to depth six to take place. When the 
maximum depth of loop unrolling is fixed at one, this algorithm is identical 
to the Bunch-Kaufman algorithm. The Cray Y-MP is a register- to-register 
parallel/ vector computer with up to eight processors. Each processor has 
independent, segmented functional units. An indefinite matrix that arose 
from an application in structural engineering was factored and the resulting 
triangular matrices were solved. The order of the matrix was 10,785 and the 
average bandwidth was 416. A significant reduction in factorization time and 
solution time for the resulting triangular systems, due to increasing depths 
of loop unrolling is shown in Figure 6. For this matrix, the combination of 
six pivot columns never met the stability criterion. 

The benefits of the algorithm will vary depending on the architecture 
and on the matrix being solved. For example, the same implementation was 
executed on the Convex C-220, an architecture with independent, segmented 
functional units. An indefinite matrix arising from the same application was 
factored, but with an order of 6734 and an average bandwidth of 336. The 
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Depth 

Factorization 
Time (sec.) 

Speedup over 
B unch- K aufman 

Triangular Solution 
Time (sec.) 

Speedup over 
B unch- K aufman 

i 

16.3 

1.00 

0.32 

1.00 

2 

13.4 

1.22 

0.23 

1.39 

3 

12.9 

1.26 

0.21 

1.52 

4 

12.6 

1.29 

o.2n 

1.52 

5 

12.8 

1.27 

0.20 

1.60 


Figure 6: Effects of different depths of loop unrolling on the Cray Y-MP 


Depth 

Factorization 
Time (sec.) 

Speedup over 
B unch- K aufman 

Triangular Solution 
Time (sec.) 

Speedup over 
Bunch-Kaufman 

1 

71.77 

1.00 

1.30 

1.00 

2 

50.99 

1.41 

1.05 

1.24 

3 

46.85 

1.53 

0.98 

1.33 

4 

45.37 

1.58 

0.96 

1.35 

5 

45.15 

1.59 

0.97 

1.34 


Figure 7: Effects of different depths of loop unrolling on the Convex C-220 

maximum speedup due to the a priori algorithm is shown in Figure 7 to be 
significantly better for this combination of architecture and matrix than for 
the combination in Figure 6. 

Observing the number of times each depth of loop unrolling is utilized 
can be useful in determining the best maximum depth of loop unrolling for 
a particular application. Shown in Figure 8 is the number of times that 
each depth of loop unrolling was utilized when factoring the same matrix 
used in the Convex C-220 test. These results will, of course, be different for 
every matrix factored, but may be similar for matrices arising from the same 
application. 
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Maximum 

No. of 

No. of 

No. of 

No. of 

No. of 

No. of 

Possible Depth 

Depth 1 


Depth 3 

Depth 4 

Depth 5 

Depth 6 

i 

6732 

1 

- 

- 

- 

- 

2 

658 

3038 

- 

- 

- 

- 

3 

516 

1195 

1276 

- 

- 

- 

4 

543 

992 

429 

730 

- 

- 

5 

561 

874 

439 

432 

276 

- 

6 

561 

874 

439 

432 


0 


Figure 8: Depths of loop unrolled achieved 

6 Summary and Future Work 

Three algorithms, each based on the Bunch-Kaufman algorithm, suitable for 
factoring symmetric indefinite matrices on high-performance architectures 
were given. The third algorithm, called the a priori strategy, was deter- 
mined to be superior to the other two and was implemented on two high- 
performance architectures, the Cray Y-MP and the Convex C-220. The a pri- 
ori algorithm was shown to be significantly faster than the Bunch-Kaufman 
algorithm. 

The a priori algorithm is also suitable for implementation on parallel 
architectures because it allows the use of matrix-matrix operations, rather 
than the matrix-vector operations used in the Bunch-Kaufman algorithm. 
The use of the a priori algorithm will increase the ratio of computation to 
synchronization. The authors are currently implementing this algorithm on 
a parallel architecture. 

Another possible application for the a priori strategy is the factorization 
of sparse matrices. A variant of the Bunch-Kaufman algorithm that includes 
pivoting to preserve sparsity is given in [3]. A modified version of the a priori 
strategy may improve the performance of this algorithm on high-performance 
architectures. 
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